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Abstract 

A probabilistic model for computer-based generation of a machine 
translation system on the basis of English-Russian parallel text corpora 
is suggested. The model is trained using parallel text corpora with pre- 
aligned source and target sentences. The training of the model results in a 
bilingual dictionary of words and "word blocks" with relevant translation 

1 "] ' probability 

^ ; 1 Introduction. 

The corpus-based statistical MT gains more popularity nowadays due to vastly 
increased capacity of modern computers. The works of P. Brown and collabo- 
rators 012], may be regarded as a typical recent example, 
^f-^ ! This paper suggests another approach to statistical MT different from that 

of Brown et al. The suggested model is trained on pre-aligned bilingual text 
corpora and the following approach to 'tailor making' a computer dictionary 
and an MT system is taken. The translation of a source word combination 
by a target one is determined by the correlation with the neighboring word 
combinations both in the source and the target texts rather than only by the 
\ translation probabilities of the combinations themselves. 

The word order of the source and target sentences seldom coincide, however, 
the raw translation with the incorrect order of words may often be understood 
■ by a specialist. The translation quality will radically improve if instead of 

J-j \ individual words one takes internally agreed word combinations with fixed order 

(blocks). 

In this model statistically stable source blocks are related to the most prob- 
able target ones using specially introduced function, called " adhesion function" 
since it is believed that this function indirectly reflects the grammatical and se- 
mantic "adhesion" of the words in a text. We believe that blocks with negative 
correlation having been excluded the remaining internally agreed blocks in a 
way will become a substitute of the proper word order in the target sentence. 
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2 Probability Assessment and Model Training. 

The training corpus is presumed to be pre-aligned, i.e. divided into the matching 
pairs of the source and target sentences 1 . Each of the sentences comprising a 
pair is broken into the word sequences (blocks) in such a way that the order of 
words a sequence had in the sentence is preserved. 

The first block comprises the first words of matching sentences, then one 
word is added each time until the block reaches the extreme length 2 . Then the 
procedure is repeated starting from the second word and so on. All the blocks 
obtained in the above manner are stored in temporary data file with the blocks 
that appear several times being regarded at this stage as different (viz. Fig. f ). 

Fig.f . Diagram illustrating the breaking of the sentence into blocks. 

We suggest two alternative procedures for the sorting-out of the prelim- 
inary data file to obtain the translation dictionary. 

Let in a sentence of a length L number of b— words blocks is L — b + 1. Total 
number of blocks with the length that does not exceed I in this sentence is 

6=1 

Number of block pairs 

W <«x<' = ' 2 ' 2L ' S '-' + 1 ]' 2L ' r> -' + " ,2, 

where and are lengths of source and target sentences correspond- 
ingly. 

Even in a texts where lengths of source and target sentences are large enough 
(say L(S) - L< T ) = 20 ) 



We see that whole volume of block pairs less then 14 times larger then the 
number of block pairs with the length that does not exceed 1 = 3. 

1 In the training corpus as well as in the translated texts the ends of sentences are presumed 
to be marked with common punctuation marks. 

2 In our case the minimal length limit of a block is three words. 
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2.0.1 a) "All-in" Relations Alternative. 

1. In this case the sentence pairs are broken into arbitrary number v and v 
of the blocks. 

2. If the number of different blocks, v, obtained after all possible divisions 
of a sentence is greater than the number of blocks, v, obtained in its 
counterpart, the latter is added with blank blocks until the number of 
blocks in both target and source sentences becomes equal w = max (v; v) . 

3. Let us relate each block of the source (English) sentence with its all target 
(Russian) counterparts 

4. The resulting w 2 pairs {SjTk} ; j;k — 1, 2, w ; are stored in the tem- 
porary data file. 



2.0.2 b) Symmetrical Relations Alternative. 

1. In this case the sentence pairs are broken into the equal number w of the 
blocks having no blank counterparts. Moreover, only the blocks with the 
same value of j are stored in the preliminary data file. 

2. Then for each division we shall have w pairs {SjTj} ; j = 1, 2, w. 

3. The resulting w pairs {SjTk} ; j, k = 1, 2, w are stored in the temporary 
data file. 

The symmetrical alternative will require a bigger training corpus, however, 
it will allow to use the same block comparison procedure in training and in 
translation. 

In both alternatives the procedure of sentence division will be terminated 
when the computer storage capacity is exhausted. 

Let then n = X^ s w s ^ e * ne total number of the matching block pairs of the 
alternative "a" n = J2s w s be that of the alternative "b" . Then the total number 
of the source blocks S — (si,s 2 ,...), that of the target blocks T — (£i , t 2 , -■-) 
and the total of the pairs {S, T} will be given by n s , n T , n sr]T , whereas the 
relevant probabilities P 5 , P T and P snT will be found from: 

T S T 1 " 1 S 

P T = —; P s = — ■ p^s^n 

n n n 

Then the conventional probability P(T|S') = p Tn s^pS Q f tnc wor( j T as a 
translation of the word 5* and the probability P (S\T) — p T ^ s jp T G f the word 
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as a translation of the word are related by Bayes formula 



P(T\S)P S = P Tns = P(S\T)P T (4) 
As it is well known, the correlation between the events and will be 

c rn s = P m s _ps pT (5) 

When events S and T are independent, i.e. co-occur at random, P T ^ S = P S P T 
, the correlation function C T ^ S becomes zero. Then it may be suggested that 
the negative correlation C Tf]s < will be the case, when the source and tar- 
get language words are so eager to avoid each other, that their correlated co- 
occurrence is less probable than the random one. We regard such co-occurrences 
as prohibited by the rules of the languages involved. 

The correlation analysis starts from the minimal, one-word blocks S = 
(si) , T = (ti). The longer two-word S — (si,S2), T = (tj^ia) an d three-word 
T = (ii,i2>*3)j S = (si,S2,S3) ones are analysed if 

(j(si,S 2 ) _ p(si,S 2 ) _ p(si) p(s 2 ) Q ^g-J 

and accordingly if 

p(tiMM) _ p(ti,t 2 ) p(t 3 ) > q fjj 

or 

p(tl,t2,t 3 ) _ p(tl) p(,t 2 ,t 3 ) ^ Q ^ 

and so on. 

To save the storage space we shall pay attention only to closely correlated 
events, for which the relative correlation 3 : 

to satisfy the condition 

p Sj D s j+ i > C S_. p s 3 n s j+1 < _ c s ( 1Q j 

p r 3 n t 3+1 > c t. pTj n t j+1 < _ c t 

pTj[] S k >C TS. p T jn S k < „ C TS (12) 

All pairs Sj (J Sj+i, consisting of sub-blocks Sj and Sj+i will be included 
in ^-dictionary, if sub-blocks Sj and Sj-+i satisfy the condition ()10|l. Similarly 

3 Absolute correlation value of the blocks divided by the value of their random co-occurrence 
which accounts for the rare but strongly correlated events. 
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if sub-blocks Tj and Tj+i satisfy the condition (|ll|l they are included in a T- 
dictionary. And, finally, if the condition <|12|) is satisfied, we include (Tj, S^pairs 
into TS'-dictionary. The values of positive constants c±, c^and c± s naturally 
depend on the computer storage capacity. In this way we shall be able to 
calculate both P (S\T) and P (T\S) which will allow to reverse the direction of 
the translation. 

All the elements of a word paradigm enter the dictionaries as separate entries. 
Both the selection of a correct (and strongly prohibited) form for translation 
and agreement between the forms are achievable, on the one hand, because the 
forms within a block are already agreed and, on the other, because reasonable 
agreement of paradigm forms in matching blocks is obtained in the course of 
maximisation, as described below. 

The training may be simplified if we have a dictionary of cognates 4 . In 
this case the preliminary data file will not include the pairs in which one block 
comprises a cognate whereas its counterpart does not 5 When the dictionary is 
generated (i.e. available amount of training corpora is exhausted), we pass over 
to the translation using a new text. 

3 Translation Model Optimisation 

The translation of a new sentence starts from dividing it into blocks. This is 
being done in such a way that none of the blocks is wholly contained in any 
other. To satisfy this condition any next block will begin with, at least, one 
word after the first word of the previous block and will end with, at least, one 
word after the last word of the preceding block. Each of the source blocks will 
be related to the target ones. 

The division starts from the blocks of the maximum length available in 
the dictionary, and the block length is gradually decreased to the word-to-word 
pairs. To select the optimal translations we shall use the following maximisation 
procedure. 

For the words in a source (or target) text we suggest the characteristic of 
'adhesion". We shall call "adhered" both the words which enter one and the 
same block and those entering the overlapping blocks. Thus, in Fig. 1 the 
words abc, bed, cdef and gh adhere into blocks and since the words b, c, d enter 
several blocks simultaneously they are also considered adhered. Words ag, ah, 
bg, bh and so forth are not adhered. Fig 1. shows the source sentence only. It 
is understood that for simplicity the target sentence will have the same block 
pattern. Naturally, in both texts the blocks with multiple overlapping will be 
those having greater adhesion. At the same time, the longer is the block the 
smaller is its occurrence probability in the dictionary after training . For equal 
competition opportunities for longer and shorter blocks the following procedure 
is suggested. To illustrate this let us consider the blocks of maximum two words 

4 The cognates are the words of similar graphic image in different languages, c. g. syntax 
and sintaksis. 

5 Identification and use of cognates may be found, e.g., in Q],and Q]. 
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Fig. 2. A Diagram of a 3- Word Sentence Translated by Two Overlapping 

Blocks. 

and assume that a three-word sentence is translated by the two linked blocks 
(Fig. 2 ). 



Of course, all the words in Fig. 2 are adhered and the source sentence 
cannot be translated by one target sentence only because of our two- word block 
constraints. We suggest the following model-type relation to compute the true 
probability : 

p(tiMM) p(ti,t 2 ) p(t 2 ,t 3 ) j ^ 2 jj ^3) 

i.e. we suggest that the relation of the true probability to the probabilities 
of the individual blocks, p( fl '* a 'and, P^' 2 '* 3 ^depends only on the probability 
of the overlapping words p(* 2 ) 6 . Generally speaking, finding the overlapping 
probability function / (P (£2)) requires a special phenomenological study, but 
for our model we limit ourselves with the following simple considerations. It is 
easy to see that if all the words are not adhered with the others, then 

r 1 2 

p(HM,t 3 ) p(*i) p(ta) pits) (14) 
and hence having substituted (|14fl into we obtain for this very special case: 



/ (P) « UP 



(15) 



We hope that this approximation will give satisfactory results for the general 
case, that is why we assign the factor 1/P each time the words in blocks overlap. 

The function / (P) is introduced to accord the blocks and its form is pre- 
sumed to be universal for the given language. We shall call it global adhesion 
factor (GAF). A more effective way to account for the overlapping of the blocks 
is to introduce local adhesion factor( LAF ) for each word rather than GAF: 



/t 2 — 



p{t\MM) 
p(t u t 2 ) p(t 2 ,t 3 ) 



(16) 



LAF f t2 for each t 2 - word is at first computed for all p(*i .*».*»), p(^M and 
p{t2,t 3 ) available and then averaged over iiand £3. In this case / t2 really becomes 
an inherent characteristics of an t 2 - word. It easy to see that in P (S\T) = 
p.s n T jpT overlapping of t- words is present both in P snT and P T , hence, 

6 For the sake of simplicity we show the adhesion function only for the target blocks, it is 
understood, however, that similar function is calculated in the same way for the source blocks 
as well. 
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LAF values for £- words are cancelled, then during translation stage we take 
into account only LAF for s-words. Then for a sentence we have: 



^=n/f' ns "'=n, n„„™,.,A 



K^eSj-nSi+i) 



(17) 



where the product is computed over overlapping s-words. 

An overlapping in the source sentences ( e.g. s 2 in Fig. 2) may be related to 
that in the target sentence (e.g. £2 in Fig. 2). During translation combining the 
target blocks we may get double occurrence of the overlapping word ( e.g., when 
combining T — (£i,£2)and T = (£2^3) we get double occurrence of £2) , which 
are to be excluded from the translation product. One should also exclude the 
synonyms as well. Having excluded double occurrences we shall obtain a set of 
fi = 1, 2, ... translation alternatives {T^ (J T^ +1 } combining several neighboring 
blocks k and k + 1, some of which may be grammatically incorrect. 

We suggest the following correction procedure: 

a) Each of {T£ (J T^ +1 }alternat ives is broken into all possible sub-blocks ; 

b) The optimal alternative is obtained by 



.{ P T W T ^} (is) 



We believe that increasing the length of blocks we shall be able to select suc- 
cessfully the translation words corresponding to the source context. Moreover, 
one will hardly require fragments longer than four words, since correlation at 
such distances seems rather weak. 

For the general case of translation probability maximisation we propose the 
following: 



n pT n s 



max ■ 



where P T j Sj — p^* 1 '* 2 '--^ ( S i^2, - ) J j g probability corresponding to block j 

in given translation alternative. The overlapping function F L J 3+1 for n^-fold 

overlapping of the words £ p in a neighbored blocks Tj and T} +a may be computed 

as fT jT: ' +1 — <Y\ p(***)l „ (see JT7|)). The maximisation procedure can 
L \t^T 3+1 

be easily modified for the source language since the suggested model is evidently 

symmetrical. 



4 Conclusions 

Similar to pQ we train our model using parallel text corpora. However, our 
model is different in a number of aspects. We consider the suggested numerical 
correlation between source and target blocks (simultaneous interpreter principle) 
more critical for translation quality than selection of optimal word positions 
through the maximisation of the product of the relevant probabilities as inp^ 
|2] • For the model, suggested in this paper, there is room for perfection limited 
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only by computation capacity through increasing the block length. In the model 
01 HI- El- however, it is not clear, how without some new modelling ideas to 
make probability-based choice between, say, such two sentences as "He is alive, 
but she is dead" and "He is dead, but she is alive" both of which are correct 
grammatically, but controversial semantically. 
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